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ABSTRACT 

Motivation: Systematic and scalable parameter estimation is a key to 
construct complex gene regulatory models and to ultimately facilitate 
an integrative systems biology approach to quantitatively understand 
the molecular mechanisms underpinning gene regulation. 
Results: Here, we report a novel framework for efficient and scalable 
parameter estimation that focuses specifically on modeling of gene 
circuits. Exploiting the structure commonly found in gene circuit 
models, this framework decomposes a system of coupled rate equa- 
tions into individual ones and efficiently integrates them separately to 
reconstruct the mean time evolution of the gene products. The accur- 
acy of the parameter estimates is refined by iteratively increasing the 
accuracy of numerical integration using the model structure. As a case 
study, we applied our framework to four gene circuit models with 
complex dynamics based on three synthetic datasets and one time 
series microarray data set. We compared our framework to three 
state-of-the-art parameter estimation methods and found that our ap- 
proach consistently generated higher quality parameter solutions effi- 
ciently. Although many general-purpose parameter estimation 
methods have been applied for modeling of gene circuits, our results 
suggest that the use of more tailored approaches to use domain-spe- 
cific information may be a key to reverse engineering of complex bio- 
logical systems. 

Availability: http://sfb.kaust.edu.sa/Pages/Software.aspx 
Contact: xin.gao@kaust.edu.sa 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 

1 INTRODUCTION 

A quantitative understanding of how expression of genes is con- 
trolled in time and space through the integration of computa- 
tional and experimental methods is a main goal of molecular 
systems biology (Church, 2005; Ideker et aL, 2001; Kitano, 
2002). Among the major obstacles in such an integrative systems 
biology approach is the construction of kinetic models that quan- 
titatively support the current knowledge of a given gene circuit. 
What makes the construction of gene circuit models especially 
difficult is the quantification of all reaction parameters, as direct 
measurements of gene regulation kinetics are seldom available. 
Thus, model parameters are often estimated indirectly using 
more readily available experimental data [e.g. Schoeberl et al. 
(2002); Zwolak et al. (2005)]. Even in modeling of relatively 
well-known gene circuits, such as the phage-}^ lysis-lysogeny 
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developmental pathway (Arkin et al, 1998), there are a 
number of unknown parameters, which are phenomenologically 
determined by fitting the model's outputs to some experimental 
observations. 

The quahty of time series gene expression data is crucial to the 
construction of phenomenological models that accurately cap- 
ture the observed dynamical characteristics of a given gene cir- 
cuit. With advances in the gene expression detection 
technologies, single-molecule level measurements of gene expres- 
sion can now be obtained in a wide range of organisms (Baugh 
et al, 2011; Cai et al, 2006; Golding et al, 2005; Materna et al., 
2010; Newman et al, 2006; Suter et al, 2011; Taniguchi et al, 
2010; Zenklusen et al, 2008). In particular, recent advances in 
fluorescence imaging techniques (Joo et al, 2008; Raj and van 
Oudenaarden, 2009) facihtate real-time measurements of gene 
expression at the single-molecule level, making more accurate 
parameter estimation for quantitative modeling of gene circuits 
possible. Such single-cell gene expression data are, however, 
noisy because of intrinsic and extrinsic fluctuations (Cai et al, 
2006; Elowitz et al, 2002; Newman et al, 2006; Golding et al, 
2005; Raj et al, 2006; Raser and O'Shea, 2005; Suter et al, 201 1; 
Taniguchi et al, 2010) and often limited to lower concentration 
molecular species such as mRNAs (Kulkarni, 2011; van Oijen, 
2011). Because of such noisy gene expression and highly non- 
linear dynamics involved in transcriptional regulations, manual 
parameter estimation in nontrivial gene circuit models is gener- 
ally infeasible. 

To systematically estimate the parameters of a biochemical 
kinetic model, the parameter estimation problem is often treated 
as an optimization problem in which parameter values are se- 
lected to minimize a certain objective function (Schwartz, 2008). 
Although several stochastic optimization and Bayesian-based 
methods were successfully apphed to estimate parameters of bio- 
chemical models (Baker et al, 2010; Moles et al, 2003), they 
often suffer from scalabihty problems when there are a large 
number of unknown parameters. To make the estimation of par- 
ameters more efficient, several methods have been proposed to 
reduce the parameter search space by decomposing rate equa- 
tions (Jia et al, 2011; Koh et al, 2006; Zhan and Yeung, 2011). 
However, the quality of these methods strongly depends on in- 
terpolation and smoothing functions, which are often independ- 
ent of the underlying model structure and can add strong 
artifacts. Recently, Kalman filter-based approaches, which can 
alleviate the scalability problem, were apphed to efficiently esti- 
mate kinetic parameters (Lillacci and Khammash, 2010; Quach 
et al, 2007; Sun et al, 2008). Although these approaches support 
parameter estimation of models with unobserved variables, a 
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recent comparative study showed that their performance could 
be sensitive to the initial condition, and estimated parameters 
might be far from the true ones if the initial guess was not 
close to the solution (Liu and Niranjan, 2012). Most of these 
existing methods are applicable to parameter estimation prob- 
lems of generic dynamical models, as they do not demand any 
domain- specific knowledge. Although these general-purpose 
methods can easily be applied to modeling of any biological 
systems, it is clear that each of these methods has its advantages 
and disadvantages, and that no single method is versatile enough 
to efficiently give optimal parameter sets for all biological 
models. This observation has led us to develop a more tailored 
parameter estimation method that focuses on a specific yet im- 
portant subclass of biological models, namely, gene circuit 
models. 

To facilitate the mechanistic construction of thermodynamics- 
based models (Shea and Ackers, 1985; Sherman and Cohen, 
2012) that describe the quantitative behavior of gene regulation 
from time series mRNA data, we developed a novel parameter 
estimation framework called Parameter Estimation by 
Decomposition and Integration (PEDI) that specifically focuses 
on modeling of gene circuits. The main paradigm of PEDI is 
'divide' and 'conquer'; by using the given mRNA data and 
exploiting the structure of gene circuit models, our framework 
divides a high-dimensional parameter estimation problem into 
subproblems with a much smaller parameter space, each of 
which is, in turn, conquered (i.e. solved) by using any constrained 
optimization method. At the initial step, this problem reduction 
process leads to a crude linearization for numerical integrations, 
which often results in poor estimates especially for highly non- 
linear systems. To improve the quality of the estimate with a 
basically negligible increase in computing time, PEDI places 
intermediate integration points using the underlying structural 
information of a given gene circuit model and iteratively in- 
creases the accuracy of these intermediate points to increase 
the accuracy of the numerical integration, which in turn im- 
proves the reconstructed dynamics. This article introduces 
PEDI and, through the use of simulated annealing (SA) as the 
optimization method, applies the framework to three-gene circuit 
models with complex dynamics based on synthetic time series 
mRNA datasets and one yeast gene circuit model based on 
time series microarray data. We compared PEDI with three 
state-of-the-art parameter estimation methods, namely, the evo- 
lutionary strategy with stochastic ranking (SRES) (Runarsson and 
Yao, 2000), the moment matching method coupled with hybrid 
extended Kalman filter (HEKF + MM) (Lillacci and Khammash, 
2010) and the two-phase dynamic decoupling method (TDDM) 
(Jia et ai, 2011). Our results show that PEDI consistently pro- 
duced the most accurate estimates efficiently in all the four par- 
ameter estimation experiments. This study, thus, demonstrated 
that PEDI could provide an effective approach to efficiently 
estimating kinetic parameters of gene circuit models. 

2 METHODS 

2.1 Problem setting 

We concern ourselves with time series gene expression data generated 
from an A^-gene network at equally spaced M + 1 time points, 



to<ti • • • <tM- Gene gi is transcribed into mRNA m,, which is then 
translated into protein pi, which can then be used to regulate the tran- 
scription of genes in the network. We denote by my and Py random 
variables representing the levels of the mRNA copy and the protein 
copy of gene gi at time tj, respectively. We further assume that these 
random variables be expressed as follows: 

pg = fjip.. + Uij, for / = 1, . . . , and 7 = 0, . . . , M, 

where iim.. and iip.. are the true mean of m^- and Py, respectively, whereas 
each of Vy and Uy is a statistically independent random variable with mean 
0. We consider that only the levels of mRNAs are observable from the 
experiments, but we assume that the true mean of each protein pi be 
known at time Iq. 

Here, we are interested in constructing a kinetic model that estimates 
the average trajectory of mRNAs given by and we do not focus on 
the time evolution of higher moments, as experimental time-series data 
often contain only few datasets. Our model describes the average time 
evolution of a gene circuit as a continuous-time deterministic process, 
which is governed by a system of ordinary differential equations 
(ODEs) as follows: 

drhi 

—— = hi(p; ei)-eami, 
— = airui - ^ipu 

with the initial conditions: 

m,(/o) = m/o, 

Piito) = fip.,, for / = 1, ...,A^. 

Here, m, and pj are time-dependent variables that estimate the dynamics 
of /x,„. and /x^.., respectively; 6i = (6a, . . . , OjK/) is a Kj dimensional vector 
that represents the parameters used in the rate equation representing the 
regulation of mRNA m,-; hf is the transcription rate function based on the 
equilibrium thermodynamics model of the cis regulation of gene gf, p is an 
dimensional vector whose i-th element is pf, ai and are the param- 
eters used in the regulation of protein Pi, which we assume to be known; 
and fnij is the sample mean of m,- at time tj. The regulation of each gene is 
modeled using four reaction processes, transcription, mRNA degrad- 
ation, translation and protein degradation, whereby transcription is con- 
sidered to be the main regulatory step. Using this model, our objective is 
to search for values of the unknown parameters, which minimize a 
weighted sum of squared residuals of the sample means of mRNAs. As 
a system of coupled nonlinear rate equations can seldom be solved ana- 
lytically, the numerical integration via simulation is usually used to esti- 
mate the levels of mRNAs for a given parameter value. When the 
dimension of the unknown parameters is high, however, finding practical 
solution for 6 = {6\, ...,6^) based on simulation for each parameter 
change becomes computationally intensive, and such an approach even- 
tually proves to be infeasible. 

2.2 Overview of PEDI 

Figure 1 illustrates a high-level workflow of PEDI. The main idea of 
our framework is to optimize parameters separately by using what is 
available rather than dealing with exponentially larger parameter space 
involved in the optimization of 6. The framework takes advantage of the 
fact that the rate functions in our gene circuit models have the following 
structure: 

t=M-ky, (2) 
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Fig. 1. An illustration of the workflow of PEDI. Briefly, given a model 
structure and time series mRNA data at M + 1 time points, it first makes 
a linear assumption and estimates the proteins data points, which are then 
used to estimate parameters for the mRNA regulations. These initial 
estimates are then iteratively refined by placing + 1 mRNA integra- 
tion points and Lp + \ protein integration points and by increasing the 
accuracy of these integration points 



whose definite time integral from time to time tj has the following form: 

Jtj-i 



(3) 



where tj-\ <tj. As the transcription rate functions depend on regulatory 
proteins, our parameter estimation framework based on the decompos- 
ition of a gene circuit model requires the estimate of the protein levels 
first. To this end, PEDI uses the time series sample average of the mRNA 
and makes a linear assumption to estimate the mean time evolution of 
each protein level. However, as gene circuits often involve highly non- 
linear reactions, such a crude linear interpolation may result in an inad- 
equate parameter estimation. To refine the quality of the parameter 
estimation, the framework enriches the number of the data points by 
estimating intermediate points of the observed data points using the 
output from a computational simulation. These intermediate data 
points are then used to make the interpolation of the observed data 



points and the numerical integration of the rate functions more accurate. 
The introduction of these intermediate data points does not increase the 
complexity of the parameter search space, as they are only used for nu- 
merical integrations. By repeating this process, PEDI attempts to increase 
the accuracy of the interpolation and the fitness of the parameter estima- 
tion. Thus, PEDI can efficiently perform parameter estimation by avoid- 
ing computationally intensive search in a high-dimensional parameter 
space while keeping the quality of the parameter estimation high. 

As PEDI decomposes a system of ODEs into individual ODEs, it has 
an objective function for each mRNA m,. The form of the objective 
function of mRNA is a weighted sum of squared residuals. More 
detailed information on the objective functions in PEDI is described in 
Supplementary Section SI. 

2.3 Initial optimization process 

PEDI decomposes a gene circuit model into individual rate equations. 
This process involves uncoupling of coupled rate equations. To estimate 
the time evolution of the mRNAs from the decomposed rate equations, 
we first need to estimate the time evolution of the transcription factors of 
each gene in the model. Thus, the first step of our framework is to gen- 
erate the initial estimate of each protein copy at the M time points (i.e. ti 
to Im)- To this end, we estimate pi{tj) by applying the time-integral form 
in Equation (3), using the time series sample average of m,- and using the 
trapezoidal rule to approximate the numerical integration. This estimates 
the mean levels of the transcription factors of each gene gj at the M time 
points, making the evaluation of the transcriptional kinetic function of 
each mRNA mi at the M time points possible. 

Using the initial estimates of the protein levels at the M time points, 
PEDI sets out to estimate the mean time-course of m, by optimizing the 
value of To estimate the mean time evolution of m,-, we once again use 
the time-integral form in Equation (3) and apply the trapezoidal rule to 
approximate the integration of the rate equation of m, . This approximate 
integration is used to compute for each / with a given parameter 
combination, which is then used in a metaheuristic optimization — such 
as SA and genetic algorithms — to test the fitness of each parameter com- 
bination and to find the initial estimate of the optimal 6i. This parameter 
optimization process is largely independent of the values of the param- 
eters in a model and remains efficient even when a combination of the 
parameter values makes the timescale of some rate equations widely dif- 
ferent and the ODEs stiff 

While facilitating an efficient and scalable parameter estimation, a 
model decomposition involving the linear approximation of the time in- 
tegral of each rate equation may not result in a high-quality estimate, 
especially when the model of interest is highly nonlinear or the given time 
series data are sparse. In addition, such a linearization inevitably intro- 
duces integration errors, making the assessment of the prediction error 
for each parameter combination difficult. More detailed information 
on the initial optimization process is described in Supplementary 
Section SI. 

2.4 Parameter estimation refinement 

To improve the accuracy of the numerical integration and the parameter 
estimation, PEDI next performs a simulation of the ODE model, given 
the current estimate of ^. This simulation-based numerical integration not 
only gives a much more accurate picture in terms of the performance of 
the current estimate but also generates an arbitrary number of data points 
for each rht and pi. From this simulation, we generate + 1 equally 
spaced m,- data points between and Im for each mRNA, where we set 
Lm = CmM for some integer Cm>\. The value of may come with 
additional constraints depending on the choice of a numerical integration 
method. 

By using the simulated mRNA data points, PEDI attempts to better 
estimate the time evolution of proteins than the simple linear 
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interpolation that is used in the initial estimate. To this end, we first 
adjust the simulated data of so that they can better reflect the time 
evolution of the sample mean of each m^-. Let + I be the number 
of simulated mRNA data points in each time interval between time 
points tj and tj+i (i.e. = Cm). Then, to estimate the d^ + 1 mRNA 
points in this time interval, we adjust every simulated data point between 
tj and tj+\ by considering the difference between the sample mean and the 
simulated data point of mRNA m^. Specifically, by letting A/ be the time 
interval between tj and tj^ i and rhj be a time-dependent variable that 
represents the + 1 adjusted data points, we express m,(// + ^,„A/) 
for all qm e {0, l/d,n, 2/ dm, . . . , 1} as follows: 



m{tj + qm^t) = rkiitj + ^„,A0 + (1 - qm)rj + qm 



7+1' 



(4) 



where Fj is m^- — rkiitj). This definition makes sure that, at each time point 
tj, we have Tn(tj) = niij. This allows us to use the additional data points 
from rfii to make the interpolation of and, in turn, the estimation of 
the time evolution of pj more accurate than the ones based on a simple 
linearization. 

By using the + 1 data points of between to and tu, we generate 
equally spaced Lp -\- I data points for each pf. Here, we require that 
Lp = dpM be smaller than where dp is a positive integer. We de- 
note by Pf a time-dependent variable that represents the Lp -\- I data 
points of Pi. To compute the values of p^ at the Lp + 1 time points, we 
first set Pi(to) to be /x^.^ . Next, we iteratively compute the next data point 
of p) for the other Lp time points. To this end, we integrate the rate 
equation of pi between each time interval At /dp using the L^/Lp data 
points of rfii within this time interval (see Supplementary Section SI for 
details). 

Using the Lp + \ time points of the newly generated protein variables, 
p{t) = ( Piit), . . . , p^it)), we can better interpolate the dynamics of the 
transcription factors of gene gj and estimate the dynamics of mi(tj) from 
the decomposed rate equation of for a given parameter combination of 
6i. Thus, applying this approach for the calculation of within an opti- 
mization method, we can search for a parameter combination 9i of 9i that 
increases the quality of the estimate (see Supplementary Section SI for 
details). 

By using 6 = (6i, • • • , 6^) generated from this optimization, we simu- 
late the model and calculate the sum of We repeat the parameter 
refinement process until a given termination condition is satisfied (e.g. 
until the value of the sum of stabilizes). For the next iteration of the 
refinement process, if the current error is smaller than the previous one, 
we use the current 6 as the seed parameter values for the next iteration. 
Otherwise, we select the current 9 over the previous one at probability of 
/'waxexp(l — 6c/6o) whcrc Pmax IS the maximum probability of choosing 
the current estimate, 6^ is the current sum of Jt and 6^ is the previous one. 
That is, if the error from the current 6 is worse than the previous one, the 
probability of accepting the current 9 for the next round becomes smaller. 
The detailed information of specific configurations of PEDI used in 
the Section 3 of this article is described in Supplementary Sections SI 
and S2. 



2.5 Prediction error 

Optimization-based parameter estimation methods may have different 
objective functions. To compare the accuracy of estimated parameters 
of various parameter estimation methods objectively and without depend- 
ing on any specific objective functions, we define the prediction error of 
the i-th mRNA as follows: 



PEi : 



M 



\mi{ tj)-mij\ 



(5) 



where e is a small fixed value and the prediction error of the model as 
follows: 



PE=^ PEi. 



(6) 



In other words, we defined the prediction error to be the sum of the dif- 
ference between the sample mean and the estimate with respect to the 
difference between the sample mean and the true mean at each time point 
and for each mRNA. As this definition of prediction error depends on 
the true mean of mRNAs — ^whose values are hidden from objective 
functions — this prediction error can be more objective to compare par- 
ameter estimation methods than using a specific objective function 
(e.g. sum of mean squared error). However, this definition can only be 
used when mRNA data are synthesized from a model, as the true mean 
values are not available in real biological systems. In this study, we set s 
to be 0.1. 



3 RESULTS 

3.1 Models 

To test the performance of our parameter estimation framework, 
we constructed three different gene circuit models, M 1 , M2 and 
M3 (see Fig. 2). The first system is a three-gene circuit (Fig. 2A). 
In this system, the transcription of gene gi is upregulated by 
protein p2 and downregulated by protein p^. The transcription 
of gene g2 is repressed by protein pi, forming a negative feedback 
loop of gene g2. Such a regulatory structure can be seen, for 
example, in the phage-}^ lysis-lysogeny decision circuit in which 
CII upregulates synthesis of CI, CI in turn downregulates syn- 
thesis of C//and Cro downregulates synthesis of C/(Arkin et ai, 
1998). Provided that the level of protein p2 is high and the level of 
protein pi is low, this system exhibits a complex transient behav- 
ior. In this setting, protein pi initially increases rapidly because of 
the upregulation facilitated by protein p2, and this increase in 
protein p^ downregulates gene g2, leading to a rapid decrease 
in protein p2, which, in turn, downregulates gene gi and so on. 
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Fig. 2. The schematics of the three gene circuits used in this study. (A) 
The gene circuit structure of model M. 1 . (B) The three-gene repressilator 
structure represented in model M2. (C) The seven-gene repressilator 
structure represented in model M3 
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We modeled this gene circuit by the foUowing system of ODEs, 
which we refer to as model M 1 : 

dm I hikxpif 

dmi ks 
~df^\+{k^p,r~'^^'^^' 
dm^ , 

-— = ke- y^m^, 
dt 

^ = aifHi — PiPi, for / = 1, 2, 3. 
at 

In this model, we treated the equilibrium rate constants and 
the maximum transcription rates: ki, ...,ke as unknown and 
estimated them from synthetic time series mRNA data by 
adding a Gaussian noise to the simulated data. We simulated 
this model from time 0 to 1500 time units and sampled mRNA 
data at 3 1 time points. The initial condition for the simulation is 
m\(0) = 0, m2(0) = 500 and m3(0) = 0, and all protein mol- 
ecules are initially set to be absent in the system. We note that 
this parameter estimation problem has an infinite number of 
suboptimal solutions. This is because the transcriptional regula- 
tion kinetic function of pi can be simplified to 
k2>{k\p2f^ Ii{k\p2f^ +{k2P2>Y^) if k\ x p2 or k2 x /?3 is assumed 
to be always much greater than 1. In such a case, there is an 
infinite number of combinations of k\ and k2 with the same 
k^2 lk\^ ratio that produce the same dynamics. As a result, we 
could obtain an infinite number of optimal solutions if this as- 
sumption were to be satisfied. However, as the initial amounts of 
proteins p2 and p^ are zero, those solutions may just be subopti- 
mal and may not capture the initial transient behavior well. 
Thus, it is challenging to find an optimal solution that can cap- 
ture the initial transient behavior without being stuck in one of 
those suboptimal solutions. Supplementary Table SI shows the 
values of the parameters used in the simulation. 

The second and third models are both based on a gene circuit 
structure, which has a potential to exhibit a sustained oscillation. 
This gene circuit is called repressilator, which was synthetically 
constructed to exhibit an oscillatory behavior based on transcrip- 
tion regulation with cyclic repression (Elowitz and Leibler, 2000). 
We model the mean time evolution of w-gene repressilator by the 
following system of ODEs: 

dm _ kpi 

dt I -\- (keiP(i-l)) ' 

^ = ami — PPi, fori= 1 , 

where po is equivalent to Pn. By having an odd number of inter- 
acting genes, the repressilator can exhibit an oscillation under 
specific parameter conditions. Here, we constructed two repres- 
silator models with a different number of genes. One is a three- 
gene model, which we refer to as model M2, and the other one is 
a seven- gene model, which we refer to as model M3 (Fig. 2B and 
C). Given the parameter combinations that we selected (see 
Supplementary Tables S2 and S3), these two models exhibit os- 
cillatory dynamics as the eigenvalues of the Jacobian matrix at 
the fixed point contain complex numbers. We set the initial 



condition of model M2 to be mi(0) = 50, pi(0) = 500, and the 
other molecular species to be initially absent in the system. For 
the initial condition of model A^3, we set mi(0) = 20 and the 
other molecular species to be initially absent in the system. To 
generate the dataset for the true mean trajectory of the mRNAs 
in the two-gene circuit models, we simulated each deterministic 
model and sampled the mRNA levels at 3 1 equally spaced time 
points. Using these true mean trajectory datasets, we later added 
the noise term for each data point for each sample and generated 
the sample mean of each mRNA. 

3.2 Comparison using synthetic data 

In this study, we used as the optimization component in PEDI an 
adaptive SA algorithm (Kirkpatrick et ai, 1983) in which the 
parameter to control the temperature schedule can be changed. 
To measure the improvement made by PEDI, we also used the 
SA algorithm — without the PEDI framework — for the param- 
eter estimation of models Ml and M2 and compared the per- 
formance between PEDI and SA (see Supplementary Section 
S3). Our results demonstrated that PEDI improved the consist- 
ency and the accuracy of parameter estimation while increasing 
the runtime efficiency. 

Next, we compared PEDI with three state-of-the-art param- 
eter estimation methods, namely, the SRES (Runarsson and 
Yao, 2000), the HEKF + MM (Lillacci and Khammash, 2010) 
and the TDDM (Jia et al., 2011). SRES is an evolutionary opti- 
mization algorithm, which was reported to be among the best 
candidates for parameter estimation of biological models in pre- 
vious comparative studies (Moles et ai, 2003; Sun et al., 2012). 
HEKF + MM is a hybrid parameter estimation method, which 
first applies the hybrid extended Kalman filter and then, if ne- 
cessary, applies the moment-matching method as the refinement 
step. TDDM is another model decomposition-based method, 
which avoids costly ODE simulations by estimating the slopes 
of smooth piecewise polynomial functions that interpolate 
observed data. While PEDI, HEKF and TDDM were all imple- 
mented in Matlab, the moment-matching method was imple- 
mented in C. SRES was implemented in Matlab, but it calls a 
C library for stochastic ranking computations. Thus, we expected 
that the efficiency of HEKF + MM and SRES might be over- 
estimated from the comparisons based on computational time. 

To generate a sample data point of a given mRNA at a given 
time point, we sampled a value by adding to the true mean of 
mRNA a Gaussian random variable with mean 0 and variance 
being the time average of the true means of this mRNA. For the 
experiments with models M\ and M2, we generated four time 
series data samples and used the average of the four as the 
observed dataset. To analyze the performance of parameter es- 
timation methods with a time series dataset at a higher noise 
level, we generated only a single time series data sample and 
used this as the observed dataset in the experiments with 
model M3. For the parameter estimation of each of the three 
models, we ran each method 10 times. Detailed information 
about the specific settings used in the three parameter estimation 
methods in this comparison is described in Supplementary 
Section S4. 

To evaluate the performance of each method, we used four 
basic criteria, the computational efficiency, the quality of 
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reconstructed dynamics, the accuracy of estimated parameters 
and the quahty of predictability. The computational efficiency 
was measured by computing the average runtime of the 10 runs 
from each method. The quality of reconstructed dynamics was 
analyzed by measuring the average, the smallest and the largest 
prediction errors of each method, whereas the accuracy of esti- 
mated parameters was analyzed by measuring the average rela- 
tive error of the estimated parameter set with the smallest 
prediction error for each method. Finally, the quality of predict- 
ability was analyzed by extrapolating mRNA data at the next k 
observed time points using the parameter set with the smallest 
prediction error of each method; we measured the mean squared 
distance between the estimated data points and 100 samples that 
are generated for each of the k observed data points where we set 
to be 1, 3 and 5. 

The results from the comparison of the four methods using 
model Ml are summarized in Table 1. Here, PEDI outper- 
formed the other methods in three of the four criteria. Both 
SRES (with 100 generations of evolution) and TDDM per- 
formed poorly in terms of efficiency and accuracy. Although 
HEKF + MM was the most efficient method in this experiment, 
PEDI was also relatively efficient (0.5 min versus 2.2 min). By 
comparing the best parameter solutions of these two methods, 
we found that PEDI generated the most accurate estimate (Fig. 
3 A and Supplementary Fig. S9). HEKF + MM produced par- 
ameter sets with negative values in 7 of the 10 runs. This is due to 
the fact that the moment-matching algorithm used an uncon- 
strained local optimization technique (Lillacci and Khammash, 
2010). To analyze the typical behavior of each method, we 
measured the average prediction error and the average relative 
parameter error (Table 1 and Supplementary Fig. SIO). These 
show that PEDI generated not only the best parameter solution 
but also the highest quality parameter solutions on average. 
PEDI was also able to extrapolate the mRNA levels at next 
few time points more accurately than the other three. Taken 
together, we found that PEDI was able to generate the 
highest quality parameter solutions efficiently among the four 
methods. 



Table 1. Comparison of the results from model M 1 



Next, we analyzed the performance of the four methods using 
model M2. These results are summarized in Table 2. In this 
experiment, PEDI outperformed the other methods in three of 
the four criteria. Although HEKF + MM was once again the 
most efficient method in this experiment, both PEDI and 
TDDM had comparable speed with HEKF + MM. Again, 
PEDI substantially outperformed the other methods in terms 
of the quality of the estimates in this experiment. By comparing 
the best parameter solution of each method, we found that PEDI 
generated high-quality estimates with the smallest prediction 
error (Fig. 3B and Supplementary Fig. Sll). Even the worst 
parameter solution of PEDI had a lower prediction error than 
the best solution of any of the other methods (Table 2). To 
analyze the typical behavior of each method, we measured the 
average prediction error and the average relative parameter error 
of the best parameter solution (Table 2 and Supplementary 
Fig. SI 2). These show that PEDI consistently outperformed 
the other methods and produced much higher quality parameter 
solutions in a computationally efficient fashion. Furthermore, 
PEDI was also able to extrapolate the mRNA levels at next 
few time points substantially more accurately than the other 
three. 

We next applied the four methods to model M3. The com- 
parison of the four methods is summarized in Table 3. Among 
the four methods, the two model decomposition-based methods 
were much more efficient than the other two methods. TDDM 
was the fastest with its average runtime being 26 min, and PEDI 
was a close second with its average runtime being 33.5 min. 
Although HEKF + MM was the most efficient method for the 
2 three-gene models (Tables 1 and 2), it was more than four times 
slower than PEDI in this experiment. These results, coupled with 
the results from models M 1 and M2, show that model decom- 
position-based methods can scale better than typical parameter 
estimation methods. The least computational efficient method 
was SRES. We ran SRES with 2000 generations of evolution, 
which, on average, took more than eight times longer than PEDI 
did. However, the quality of the estimates from SRES was just 
on a par with that of TDDM. 

In the other three criteria, PEDI outperformed the other three 
methods substantially. In terms of the quahty of estimated par- 
ameters, PEDI performed substantially better than the other 
three methods. By comparing the best parameter solution of 



Method 


PEDI 


SRES^ 


HEKF + MM 


TDDM 


Runtime 


2.2 min 


8.1 min 


0.5 mm 


11.3 min 


Average PE 


225.9 


1584.0 




2145.0 


Best PE 


127.5 


836.5 


207.0 


2144.9 


Worst PE 


359.5 


2424.5 


N/A*^ 


2145.1 


Best param'^ 


0.046 


8490.5 


0.12 


9382.6 


Pred(l)^ 


33.6 


378.8 


40.0 


3095.9 


Pred(3)^ 


31.8 


410.4 


36.3 


2831.9 


Pred(5)^ 


33.4 


392.8 


38.6 


2620.7 



'^With 100 generations. Because seven runs resulted in negative parameter values. 

average relative error of the best parameter solution. '^Pred(k) indicates the 
mean squared distance of the next k time points. The comparison criteria are as 
follows: the average runtime; the average, best and worst prediction errors; the 
average relative error of the best parameter set; and the quality of data extrapola- 
tion. Each bold face indicates the best among the four. 



PEDI (64.92) 
HEKF+MM (429.01) 
SRES (735.61) 
TDDM (807.46) 




Fig. 3. Comparison of the four methods based on the reconstructed dy- 
namics with the smallest prediction error from models M.I and M.2. (A) 
The results of mi in A41. (B) The results of mi in M2. Here, each value 
within parentheses next to each method indicates the prediction error for 
a given mRNA, the red square points indicate the observed data points, 
and the dotted red lines indicate the true average trajectories 
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Table 2. Comparison of the results from model M2 



Method 


PEDI 


SRES^ 


HEKF + MM 


TDDM 


Runtime 


IS.Omin 


268.7 min 


10.6 min 


10.9 min 


Average PE 


781.4 


2119.9 


N/A*' 


2197.1 


Best PE 


227.0 


1854.0 


2964.2 


2196.8 


Worst PE 


1590.9 


2502.7 


N/A^ 


2197.2 


Best param'^ 


0.091 


1592.7 


0.64 


0.45 


Pred(l)'^ 


26.2 


1139.2 


1164.0 


3021.6 


Pred(3)^ 


26.7 


893.0 


1274.9 


2501.2 


Pred(5)'^ 


22.9 


767.7 


1099.4 


1765.0 



^With 400 generations. "^Because nine runs resulted in negative parameter values. 
'^The average relative error of the best parameter solution. '^Pred(k) indicates the 
mean square distance of the next k time points. The comparison criteria are as 
follows: the average runtime; the average, best and worst prediction errors; the 
average relative error of the best parameter set; and the quality of data extrapola- 
tion. Each bold face indicates the best among the four. 



Table 3. Comparison of the results from model M.3 



Method 


PEDI 


SRES^ 


HEKF + MM 


TDDM 


Runtime 


33.5 min 


279.5 min 


143.5 min 


26.0 mm 


Average PE 


1123.5 


2399.6 


N/A^ 


2250.1 


Best PE 


647.1 


2174.6 


1477.2 


2195.9 


Worst PE 


2215.5 


2735.5 


N/A^ 


2587.5 


Best param'^ 


0.16 


1.6 


0.36 


0.27 


Pred(l)'^ 


112.4 


595.3 


176.1 


558.5 


Pred(3)^ 


91.1 


525.0 


150.7 


577.6 


Pred(5)'^ 


103.5 


461.1 


182.2 


569.0 



^With 2000 generations. "^Because four runs resulted in negative parameter values. 
'^The average relative error of the best parameter solution. '^Pred(k) indicates the 
mean squared distance of the next k time points. The comparison criteria are as 
follows: the average runtime; the average, best and worst prediction errors; the 
average relative error of the best parameter set; and the quality of data extrapola- 
tion. Each bold face indicates the best among the four. 



each method, PEDI came out to be the most accurate one with 
its prediction error being at least twice as good as the other 
methods. PEDI outperformed the other methods in terms of 
the accuracy of both the reconstructed dynamics and the param- 
eter values (Fig. 4 and Supplementary Fig. SI 3). In addition, the 
average parameter solution of PEDI had at least 30% lower 
prediction error than the best parameter solution of any other 
(Table 3). Furthermore, the parameter sets from PEDI were the 
closest to the true parameter set on average, and the best par- 
ameter solution from PEDI had only 16% error to the true par- 
ameter set (Supplementary Fig. S14 and Table 3). PEDI was also 
able to extrapolate the data at the next time points much more 
accurately than the other three. As data extrapolation of bio- 
chemical dynamics — especially those with transient behaviors — 
is a challenging problem, this highlights the importance of high- 
throughput parameter estimation methods, which can generate 
high-quality parameter sets in a timely fashion to ultimately fa- 
cilitate construction of a predictive model for given biological 
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PEDI (61.8) 

^—HEKF+MM (180.28) 

SRES (228.72) 

■ TDDM (454.64) 




Fig. 4. Comparison of the four methods based on the estimated param- 
eter set with the smallest prediction error from model M3. This shows the 
results for mRNAs mi, m2 and m^. The left-hand side panels show the 
comparison for the reconstructed trajectory. The numbers in the parenth- 
eses indicate the prediction error. The dotted red curve shows the true 
trajectory, whereas the red square points indicate the synthetic data 
points. The right-hand side panels show the comparison of the parameter 
combination generating the estimated trajectory for the four methods. 
Here, the red point in the middle of each side for the parameter compari- 
son indicates the true value of each parameter 



phenomena. To test whether our performance results remain 
intact in parameter estimation of a variant of seven-gene repres- 
silator model, we modified A^3 by adding a repression connec- 
tion from gs to g6- Our results show that PEDI produced higher 
quality parameter solutions much more efficiently than the other 
methods (see Supplementary Section S5). Taken together, we 
found that PEDI was able to consistently produce high-quality 
parameter estimates under various conditions in a computation- 
ally efficient matter. 

3.3 Parameter estimation with yeast microarray data 

To compare the performance of the parameter estimation meth- 
ods using experimental data, we used time series microarray ex- 
periments of the genomic expression patterns in the yeast 
Saccharomyces cerevisiae responding to several environmental 
changes (Gasch et al, 2000). We modeled a gene circuit involving 
genes GCN4, LEU3 and ILV5. GCN4 is a master regulator of 
many genes including those for the amino acid biosynthesis path- 
way (Natarajan et al, 2001). LEU3 is a gene encoding a 
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transcription factor that regulates genes involved in amino acid 
biosynthesis, whereas ILV5 encodes an enzyme that catalyzes 
amino acid biosynthesis (Friden and Schimmel, 1988; 
Zelenaya-Troitskaya et aL, 1995). 

The network structure of these three genes is reported to 
follow the network motif called the coherent type 1 feed-forward 
loop (Mangan and Alon, 2003). We described the model of this 
feed-forward gene circuit by the following system of ODEs: 



^ = P(mi-pd, for/= 1,2,3, 

where /i is a piecewise polynomial function of t and m\,m2 and 
m3 are the mRNA copies of GCN4, LEU3 and ILV5, respect- 
ively, whereas P\,P2 and />3 are proteins Gcn4p, Leu3p and Ilv5p, 
respectively. As in Cao and Zhao (2008), we estimated the time 
evolution of m\ with a smoothing method based on spline. 

In this experiment, we set a = 1 and = 0.5 to have protein 
stability higher assuming that each transcription rate is close to 
the maximum value when the corresponding expression profile is 
at the highest. With this setting, we estimated eight unknown 
parameters in the regulation of genes LEU3 and ILV5. In this 
experiment, we did not use HEKF + MM, and we just compared 
PEDI with SRES (with 400 generations of evolution) and 
TDDM. This is because the dataset contains only one time 
course microarray data sample with eight time points, with 
which we could not satisfactorily estimate the covariance 
matrix that HEKF + MM demanded. To quantify the perform- 
ance, we measured the mean squared error of each estimate with 
respect to the observed mRNA data points, and we compared 
the best parameter solution from each method that gave the 
lowest error. We found that the error from the best solution of 
PEDI was 0.02, whereas the errors from the best solutions of 
SRES and TDDM were more than 0.2 and 0.3, respectively 
(Supplementary Table S4). This shows that PEDI was able to 
approximate the dynamics of the RNA copy of LEU3 and ILV5 
well. Indeed, the reconstructed dynamics from the best param- 
eter solution of PEDI shows a close agreement between the re- 
sults from PEDI and the time series microarray data (Fig. 5). The 
average computational time of PEDI, SRES and TDDM is 
3.3 min, 143.3 min and 1.7min, respectively. These results once 
again show that PEDI was able to generate a high-quality esti- 
mate efficiently. 

By using the best parameter solution from PEDI, we next 
analyzed regulatory mechanisms of this gene circuit. By looking 
at the parameters controlling the binding affinity of Gcn4p to the 
c/^-regulatory elements of gene LEU3 and gene ILV5, we found 
that Gcn4p had close to 20% higher binding affinity to the bind- 
ing site for gene ILV5. As the protein-DNA binding cooperativ- 
ity estimated by PEDI was high (Supplementary Table S4), we 
expect the change in transcription rates of ILV5 to be a switch- 
like, sigmoidal function of both Gcn4p and Leu3p. Thus, the 
differential binding affinity of Gcn4p allows for a delay in 




5 10 15 20 30 40 60 80 

Minutes 



Fig. 5. The reconstructed dynamics from the best parameter solution of 
PEDI for the yeast feed-forward loop model. The dynamics of mRNA 
GCN4 was estimated by a smooth piecewise polynomial function. PEDI 
predicted the dynamics of mRNAs, LEU3 and ILV5 

turning the ILV5 gene on after Gcn4p is turned on, as this 
AND-gate type transcription regulation of ILV5 in this model 
requires higher concentrations of both Gcn4p and Leu3p to turn 
ILV5 on. In the normal nonstarvation conditions, Gcn4p level is 
tightly controlled with a means of a rapid degradation through 
the ubiquitin pathway. On the other hand, Gcn4p level substan- 
tially increases in the amino acid starvation condition (Komitzer 
et aL, 1994). Our results show that the delay caused by the dif- 
ferential binding affinity in the feed-forward loop may serve as 
an extra layer of protection to ensure that this amino acid bio- 
synthesis pathway is only activated under the starvation condi- 
tion. As our hypothesis is based on one type of time series 
expression dataset with only eight time points, however, it 
needs to be taken with caution and further analysis is required 
to contrast it with alternative explanations. Indeed, our hypoth- 
esis can be validated experimentally; by changing the Gcn4p 
binding site for gene ILV5 to have a lower binding affinity, it 
predicts that the regulation of the amino acid biosynthesis path- 
way will be disrupted more easily. 

4 DISCUSSION 

The parameter estimation problem in modeling of biological sys- 
tems is challenging, as it usually involves many (often infinite) 
suboptimal solutions. Efficient and scalable parameter estima- 
tion is crucial to the systematic construction of quantitative 
models that support existing knowledge of complex biological 
systems and, more broadly, to the success of integrative systems 
biology going forward. Here, we have introduced a novel com- 
putational framework, which, instead of considering general ap- 
phcability, is customized especially for parameter estimation of 
gene circuit models. To see how PEDI performs in comparison 
with state-of-the-art parameter estimation methods, we applied 
SRES, HEKF + MM and TDDM to the parameter estimation 
of the same gene circuit models with the same datasets. We found 
that PEDI consistently gave the most accurate estimates in a 
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computationally efficient matter. To test how PEDI performs 
given experimental gene expression data, we applied it to model- 
ing of a yeast gene circuit from time series microarray data. We 
found that the reconstructed dynamics from PEDI closely agreed 
with the experimental data, and by analyzing the estimated par- 
ameter set, we were also able to make a testable hypothesis for an 
underlying regulatory mechanism of this gene circuit. 

Although PEDI can be applied to gene circuits with an arbi- 
trary size and degree of transcriptional interaction connectivity, 
there are some limitations. For example, PEDI cannot directly 
support gene regulatory models including transcriptional elong- 
ation and posttranscriptional modifications. Although we can 
relax the conditions of PEDI so as to support more generic bio- 
logical models by not requiring a model to have the form 
described by Equation (3), the efficiency and the accuracy of 
the model decomposition might decrease. While acknowledging 
the limited scope of the applicability, we believe that the value of 
a more tailored approach to the gene circuit domain far exceeds 
such potential drawbacks because of the importance of transcrip- 
tional regulation in quantitative understandings of cellular sys- 
tems. By narrowing down our focus to gene circuit models, our 
customized approach was able to display two main advantages 
over the general parameter estimation methods: (i) it can make 
more appropriate assumptions about the kinds of gene expres- 
sion data available for parameter estimation and (ii) it can ex- 
ploit the structural information on gene circuit models — in 
particular, statistical thermodynamic-based gene circuit models. 
An additional practical benefit of PEDI is that it is relatively easy 
to implement in a lower level language such as C. 
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